Cobra: Content-based Filtering and Aggregation of Blogs and RSS Feeds
نویسندگان
چکیده
Blogs and RSS feeds are becoming increasingly popular. The blogging site LiveJournal has over 11 million user accounts, and according to one report, over 1.6 million postings are made to blogs every day. The “Blogosphere” is a new hotbed of Internet-based media that represents a shift from mostly static content to dynamic, continuously-updated discussions. The problem is that finding and tracking blogs with interesting content is an extremely cumbersome process. In this paper, we present Cobra (Content-Based RSS Aggregator), a system that crawls, filters, and aggregates vast numbers of RSS feeds, delivering to each user a personalized feed based on their interests. Cobra consists of a three-tiered network of crawlers that scan web feeds, filters that match crawled articles to user subscriptions, and reflectors that provide recently-matching articles on each subscription as an RSS feed, which can be browsed using a standard RSS reader. We present the design, implementation, and evaluation of Cobra in three settings: a dedicated cluster, the Emulab testbed, and on PlanetLab. We present a detailed performance study of the Cobra system, demonstrating that the system is able to scale well to support a large number of source feeds and users; that the mean update detection latency is low (bounded by the crawler rate); and that an offline service provisioning step combined with several performance optimizations are effective at reducing memory usage and network load.
منابع مشابه
RoSeS: A Continuous Content-Based Query Engine for RSS Feeds
In this article we present RoSeS (Really Open Simple and Efficient Syndication), a generic framework for content-based RSS feed querying and aggregation. RoSeS is based on a data-centric approach, using a combination of standard database concepts like declarative query languages, views and multiquery optimization. Users create personalized feeds by defining and composing content-based filtering...
متن کاملOptimizing large collections of continuous content-based RSS aggregation queries
In this article we present RoSeS (Really Open Simple and Efficient Syndication), a generic framework for content-based RSS feed querying and aggregation. RoSeS is based on a data-centric approach, using a combination of standard database concepts like declarative query languages, views and multi-query optimization. Users create personalized feeds by defining and composing content-based filterin...
متن کاملBest-Effort Refresh Strategies for Content-Based RSS Feed Aggregation
During the past several years RSS-based content syndication has become a standard technique for efficiently and timely disseminating information on the web. From a data processing perspective RSS feeds are standard XML resources which are periodically refreshed by feed aggregators for generating continuous streams of items. In this article, we study the problem of information loss in the contex...
متن کاملFoafing the Music: Bridging the Semantic Gap in Music Recommendation
In this paper we give an overview of the Foafing the Music system. The system uses the Friend of a Friend (FOAF) and RDF Site Summary (RSS) vocabularies for recommending music to a user, depending on the user’s musical tastes and listening habits. Music information (new album releases, podcast sessions, audio from MP3 blogs, related artists’ news and upcoming gigs) is gathered from thousands of...
متن کاملRSS Feed Recommendation
Introduction Really Simple Syndication (RSS) Feeds allows users to access blogs and articles in an easy to read format. It cuts out the overhead of navigating websites for content and allows users to get information more quickly. Currently, the user is in total control of their RSS feeds, adding and deleting feeds according to their tastes. This requires the user to actively search out RSS feed...
متن کامل